Naive Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables (features/predcitors) in a learning problem. Maxumum-likelihood training can be done by evaluting a closed-form exporession, which takes linear time, rather tahn by expensive iterative approximation as used for many other typs of classifier. Wikipedia
Usecase - Spam filter
We will use the Naive Bayes algorithm to fit a spam filter.
Spam filters are used in alal email services to classify received emails as “Spam” or “Not Spam”. A simple apporach involves maintaining a vocabulary of words that commonly occur in “Spam” emails and classifying an email as “Spam” if the number of words from the dictionary that are present in the email is over a certain threshold.
Assume we are given the vocabulary consists of 15 words
$V$ = {secret, offer, low, price, valued, customer, today, dollar, million, sports, is, for, play, healthy, pizza}
We will use $V_i$ to represent the ith word in $V$. As our training dataset, we are also given 3 example spam messages:
message = ['million dollar offer','secret offer today','secret is secret', 'low price for valued customer', 'play secret sports today', 'sports is healthy', 'low price pizza']
train is our input vector x corresponding to each training message, and it has length n = 15 (length of V). Since we have 7 training example of message, We will have 7 by 15 training data - 7 data 15 features.
1 2
train = np.zeros((7, 15)) label = [0,0,0,1,1,1,1]
1 2 3 4 5 6 7 8
''' Converting 7 training message data set to x vector which has length n = 15 ''' for i inrange(len(train)): for j inrange(len(V)): for k inrange(len(message[i].split(" "))): if V[j] == message[i].split(" ")[k]: train[i,j] += 1
1 2 3 4 5 6
''' This is the feture x matrix. row = n message col = i_th feature vector ''' train
for i inrange(len(label)): if label[i] == 0: print("this is {}th message - It's SCAM".format(i)) print(train[i]) else: print("this is {}th message - It's NOT SCAM".format(i)) print(train[i])
this is 0th message - It's SCAM
[0. 1. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0.]
this is 1th message - It's SCAM
[1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
this is 2th message - It's SCAM
[2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
this is 3th message - It's NOT SCAM
[0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
this is 4th message - It's NOT SCAM
[1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0.]
this is 5th message - It's NOT SCAM
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 1. 0.]
this is 6th message - It's NOT SCAM
[0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
''' spam and ham array will be count of x_i when each message is spam or not spam. spam = (Y=0|x_i) ham = (Y=1|x_i) ''' spam = np.zeros((1, 15)) ham = np.zeros((1, 15))
for c inrange(len(label)): for i inrange(train.shape[1]): if label[c] == 0: # Spam spam[0,i] += train[c,i] else: # Not Spam ham[0,i] += train[c,i]
If you see a result above, spam[0] = 3 ,spam[1] = 2 , spam [2] = 0, it means total count of x0 = secret is three among training data classified as spam. We have two offer and zeor low in our training data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
''' In this section, we will calculate probabiliy of each word xi when a message is spam or not. prob_spam = P(x_i|y=0) prob_ham = P(x_i|y=1) ''' spam_counter = 0 ham_counter = 0
for c inrange(len(label)): if label[c] == 0: spam_counter += 1 else: ham_counter += 1 prob_spam_x_i = spam/spam_counter prob_ham_x_i = ham/ham_counter
The result above means that the probability of spam per i_th words. For example, the probability of spam when we have secret is 75% and a probability of spam of offer is 100%.
1
prob_ham_per_word # This is the probability of ham per words.
threshold = 0.50# In what percentage do you want to classfy an eamil as spam.
for t inrange(train.shape[0]): conditional_prob_spam = prob_spam conditional_prob_ham = prob_ham for i inrange(prob_spam_per_word.shape[1]): if train[t][i] == 1: conditional_prob_spam = conditional_prob_spam * prob_spam_per_word[0,i] conditional_prob_ham = conditional_prob_ham * prob_ham_per_word[0,i] if conditional_prob_spam != 0: prob = conditional_prob_spam / (conditional_prob_spam + conditional_prob_ham) * 100 else: prob = 0.0 if prob > threshold*100: label = "SPAM" else: label = "NOT SPAM" print("{} th email is {} with a probability of being spam {}%".format(t,label,prob)) #print(prob)
0 th email is SPAM with a probability of being spam 100.0%
1 th email is SPAM with a probability of being spam 100.0%
2 th email is NOT SPAM with a probability of being spam 42.857142857142854%
3 th email is NOT SPAM with a probability of being spam 0.0%
4 th email is NOT SPAM with a probability of being spam 0.0%
5 th email is NOT SPAM with a probability of being spam 0.0%
6 th email is NOT SPAM with a probability of being spam 0.0%
Given a new message “today is secret”, decide whether it is spam or not spam, based on the Naive Bayes classifier, learned from the above data.